Using blood routine indicators to establish a machine learning model for predicting liver fibrosis in patients with Schistosoma japonicum

This study intends to use the basic information and blood routine of schistosomiasis patients to establish a machine learning model for predicting liver fibrosis. We collected medical records of Schistosoma japonicum patients admitted to a hospital in China from June 2019 to June 2022. The method was to screen out the key variables and six different machine learning algorithms were used to establish prediction models. Finally, the optimal model was compared based on AUC, specificity, sensitivity and other indicators for further modeling. The interpretation of the model was shown by using the SHAP package. A total of 1049 patients’ medical records were collected, and 10 key variables were screened for modeling using lasso method, including red cell distribution width-standard deviation (RDW-SD), Mean corpuscular hemoglobin concentration (MCHC), Mean corpuscular volume (MCV), hematocrit (HCT), Red blood cells, Eosinophils, Monocytes, Lymphocytes, Neutrophils, Age. Among the 6 different machine learning algorithms, LightGBM performed the best, and its AUCs in the training set and validation set were 1 and 0.818, respectively. This study established a machine learning model for predicting liver fibrosis in patients with Schistosoma japonicum. The model could help improve the early diagnosis and provide early intervention for schistosomiasis patients with liver fibrosis.


Best model
After comparing multiple models, it was found that LightGBM performed best, and we used LightGBM for modeling.The AUC in the training set was 0.995, the AUC in the validation set was 0.804, and the AUC in the test set was 0.8367 (Fig. 2A-C).At the same time, we can see that during cross-validation, when the sample size of the training set and the validation set reaches 400, the model reaches a stable state (Fig. 2D).Supplementary Tables 2-4 showed the metrics for model evaluation on the training set, validation set, and test set, respectively.

Model interpretability
The SHAP diagram in Fig. 3A showed how each variable in the validation set contributes to the prediction of infection.The redder each point means that the absolute value of the point is larger, and the bluer the point, the smaller the absolute value of the point.The ordinate is a negative absolute value The larger the value, the greater the possibility of the predicted result being negative, and the greater the absolute value of the positive number on the vertical axis, the greater the possibility of the predicted result being positive.For example, the larger the RDW-SD value, the greater the possibility of liver fibrosis in patients, and the lower the possibility of liver fibrosis in patients with higher lymphocyte and neutrophil counts.Figure 3B showed the importance ranking of each variable.We can see that RDW-SD, lymphocytes and neutrophils are more important variables.Figure 3C and Fig. 3D used two force diagrams to show how the variables of the two samples affect the results.As shown in Fig. 3C, the patient was predicted to be infected, but was actually infected.We can see that the longest red arrow is neutrophils (0.93), indicating that neutrophils are the most important for the patient's infection.The outcome had the largest positive contribution, and the second largest positive contribution was red blood cells (3.69).There were no variables that had a negative contribution to the outcome.In Fig. 3D, the patient was predicted not to have an infection, but in fact no infection occurred.The three variables that had the most positive impact were the number of neutrophils (1.71), red blood cells (3.47), and age (77.0), the two variables that had the most negative impact on the outcome were RDW-SD (42.7) and MCV (98.3).

Discussion
After infecting the host, Schistosoma japonicum produces a large number of eggs and deposits them in tissues such as the liver.If timely and effective intervention is not performed, changes such as egg granuloma and liver fibrosis may further develop into hepatocellular carcinoma 7 .Studies have shown that liver fibrosis is not a single irreversible progression, and liver fibrosis may have the potential to regress 8 .Therefore, it has positive significance in the early diagnosis and treatment of liver fibrosis.At present, schistosomiasis has not attracted enough attention in major endemic countries, resulting in relatively lagging clinical and basic research on schistosomiasis, and there are few basic data research on schistosomiasis liver fibrosis 9 .This study predicts the risk of liver fibrosis by constructing a diagnostic model, which has important clinical significance for early and correct treatment and intervention.
This study uses a machine learning model to predict liver fibrosis in Schistosomiasis japonicum, helping clinicians to deeply understand the impact of key factors on liver fibrosis.It is helpful for early identification of liver fibrosis and distinguishing the severity of liver fibrosis, so as to timely detect patients with early liver fibrosis and improve the prognosis of them.In this study, the data of 1049 patients with Schistosomiasis japonicum were analyzed to establish a liver fibrosis prediction model using machine learning algorithms to help identify patients Overall, the key variables included in the model may play an important role in the early diagnosis of Schistosoma japonicum liver fibrosis.Previous reports point out that there is an inseparable relationship between blood routine indicators and liver fibrosis 10 , and the results of this study also support this association.The neutrophil-to-lymphocyte ratio (NLR) is widely used to assess inflammatory diseases.The study found that for patients with nonalcoholic fatty liver disease (NAFLD), NLR was significantly correlated with liver fibrosis stage and nonalcoholic fatty liver disease activity score (NAS); For chronic hepatitis B (CHB) patients, NLR was negatively correlated with liver fibrosis stage [11][12][13][14] .Therefore, NLR may be associated with the stage of liver fibrosis.Kekilli et al. also demonstrated that the ratio of neutrophils to lymphocytes reflects the severity of advanced liver fibrosis 15 .RDW is a parameter reflecting the heterogeneity of red blood cell volume, which is often used to diagnose different types of anemia, and is closely related to the body's inflammation and nutritional status.Elevated RDW often indicates shortened lifespan and increased destruction of red blood cells.Michalak et al. believe that RDW and its derivatives may be related to the deterioration of liver function 16 .Studies have shown that RDW is closely related to liver fibrosis in diseases such as NAFLD and CHB [17][18][19] .RDW can be expressed as RDW-CV and RDW-SD.RDW-SD is determined by the width of the red blood cell volume distribution curve above 20% above baseline.Studies have shown 20 that RDW-SD is closely related to significant liver fibrosis (F2-F4) in CHB and can be used as an effective predictor for significant liver fibrosis in CHB.Liu et al. [21][22][23] also found that only RDW-SD had a statistically significant difference between different stages of liver fibrosis in AIH (P = 0.046).In univariate Logistic regression analysis, RDW-SD was a risk factor for advanced liver fibrosis (F3-F4) in AIH.MCV is a parameter that reflects the volume of red blood cells, and changes in MCV suggest that the patient's hemoglobin synthesis is impaired.Liu et al. 21further found that MCV had statistically significant differences among different stages of liver fibrosis in AIH and was positively correlated with the severity of liver fibrosis.The combination of MCV and RDW can comprehensively reflect the discrete state of peripheral red blood cell volume.So far, the mechanism between RDW, MCV and liver fibrosis is unclear, and may include the following points: (1).Inflammatory cytokines may inhibit the maturation of red blood cells and accelerate the entry of  newer and larger reticulocytes into the peripheral circulation, resulting in increased RDW; (2).Patients with liver disease often have decreased intestinal absorption function, resulting in folic acid, vitamin B12 and other deficiencies, resulting in varying degrees of megaloblastic anemia and heterogeneous changes in red blood cell volume; (3).Hepatic fibrosis often causes splenomegaly and hyperfunction, which accelerates red blood cell destruction and shortens the lifespan of red blood cells, which may promote the release of immature red blood cells and eventually lead to increased RDW 17,24,25 .These studies provide a theoretical basis for the correlation between blood routine indicators and liver fibrosis, but the magnitude of the correlation and the degree of liver function deterioration have not been clearly quantified, nor have they provided a predictable space for early liver fibrosis.Machine learning can make up for this deficiency.This study also find that age is also a key variable associated with liver fibrosis in Schistosomiasis japonicum, and the model predicts that the older the age, the greater the possibility of liver fibrosis.The significance of the machine learning method for this study lies in the establishment of a clinical prediction and identification model through simple blood routine indicators and patient age to give suggestions for the diagnosis of complex liver fibrosis.This study built a machine learning model and evaluated the model by taking advantage of abundant data.Compared with the models mentioned in the published literature, this study only needs blood routine, age and gender to predict, providing clinicians with a more easy-to-operate and understandable diagnostic method.
But this study also has certain limitations.This study is a single-center retrospective study and some of the results discussed are also for an individual patient, which may not be able to avoid inherent selection bias and information bias.The next step of the study needs to conduct multi-center prospective research for external verification to further improve and promote this machine learning model.The variables of the current model only include the patient's clinical information and test results.In order to optimize the performance of the identification model, the model can also include biomarkers from microbiome and metabolomics.However, at present, only using clinical variables can also reduce the burden on patients to a certain extent, and it has a certain degree of convenience in clinical application.Finally, the insufficient interpretability of SHAP values warrants the development of more understandable models in the future.In the future, we will further develop an automatic clinical scoring system based on nomograms or machine learning based on research data in order to provide clinicians with more practical and easy-to-understand tools.

Study population
The study population consisted of patients diagnosed with Schistosoma japonicum in Yueyang, Hunan Province, China.This city has historically been a high schistosomiasis epidemic area.Because it was located near Dongting Lake in the middle and lower reaches of the Yangtze River, where the Intermediate host Oncomelania hupensis breeds in large numbers.
Schistosoma japonicum infection was diagnosed according to the definition of Zhou et al. 26 .Including the following diagnostic criteria: life history in schistosomiasis-endemic areas, contact with infected water, specific schistosoma serology testing, color ultrasound, excreta (feces, urine) microscopic examination.Schistosomiasis infection was considered when schistosome ova were visualized in stool, urine or when the Schistosoma serology was positive.
Liver fibrosis was determined by ultrasound according to the World Health Organization diagnostic criteria for Schistosoma japonicum infection 27,28 .An experienced ultrasound expert divided the patients into two groups according to the ultrasound results: fibrosis group (with mesh-like changes and uneven hepatic echotexture); no-fibrosis group (without mesh-like changes, smooth and uniform hepatic echotexture).The diagnosis was double-checked by another experienced schistosomiasis specialist.

Data collection
A retrospective medical record review was conducted from June 2019 to June 2022 at Xiangyue Hospital, Yueyang City, Hunan Province of China.All patients underwent blood tests and ultrasound evaluation at admission.All variables were extracted from the hospital's electronic medical record system.The data include: patient

Figure 1 .
Figure 1.Multi-model comparison diagram.(A) Figure A shows the AUC of multiple models in the training set.Each color represents a machine learning algorithm.(B) Figure B shows the AUC of multiple models in the validation set.

Figure 2 .
Figure 2. AUC of the LightGBM model.(A) AUC of the LightGBM model in the train set.(B) AUC of the LightGBM model in the validation set.(C) AUC of the LightGBM model in the test set.(D) Figure shows that the AUC of the LightGBM model changes according to the training sample size.The abscissa represents the sample number, and the ordinate represents the ROC value.

Figure 3 .
Figure 3. Interpretability of the model.(A) SHAP diagram.Each point represents a sample.The redder the color of the point, the larger the value of the variable, and the bluer the red, the smaller the value of the variable.The larger the ordinate of the point, the more likely the outcome is to be positive.(B) Importance ranking of key variables.The abscissa is the absolute value of the SHAP value, and the ordinate is the key variable.(C) The samples with a positive outcome.Red indicates a positive contribution to a positive outcome, and blue indicates a negative contribution to a positive outcome.The length of the bar indicates the size of the contribution.The longer the bar, the greater the contribution to the outcome.(D) The samples with negative outcome.

, median (IQR) All (n = 1049) Non-fibrosis group (n = 768) Fibrosis group (n = 281) P-value
the indicators with the largest negative contributions are RDW-SD and MCV.Except for the patient's age, other indicators are related to blood routine.

Table 2 .
Multi-model classification-validation set results.